Bootstrapping Method for Chunk Alignment in Phrase Based SMT
نویسندگان
چکیده
The processing of parallel corpus plays very crucial role for improving the overall performance in Phrase Based Statistical Machine Translation systems (PBSMT). In this paper the automatic alignments of different kind of chunks have been studied that boosts up the word alignment as well as the machine translation quality. Single-tokenization of Noun-noun MWEs, phrasal preposition (source side only) and reduplicated phrases (target side only) and the alignment of named entities and complex predicates provide the best SMT model for bootstrapping. Automatic bootstrapping on the alignment of various chunks makes significant gains over the previous best English-Bengali PB-SMT system. The source chunks are translated into the target language using the PB-SMT system and the translated chunks are compared with the original target chunk. The aligned chunks increase the size of the parallel corpus. The processes are run in a bootstrapping manner until all the source chunks have been aligned with the target chunks or no new chunk alignment is identified by the bootstrapping process. The proposed system achieves significant improvements (2.25 BLEU over the best System and 8.63 BLEU points absolute over the baseline system, 98.74% relative improvement over the baseline system) on an EnglishBengali translation task.
منابع مشابه
Phrase alignment confidence for statistical machine translation
The performance of phrase-based statistical machine translation (SMT) systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, often has an adverse impact on the reliability of the extracted phrase translation pairs. In this paper, we present a n...
متن کاملA Hybrid Word Alignment Model for Phrase-Based Statistical Machine Translation
This paper proposes a hybrid word alignment model for Phrase-Based Statistical Machine translation (PB-SMT). The proposed hybrid alignment model provides most informative alignment links which are offered by both unsupervised and semi-supervised word alignment models. Two unsupervised word alignment models (GIZA++ and Berkeley aligner) and a rule based aligner are combined together. The rule ba...
متن کاملMWE Alignment in Phrase Based Statistical Machine Translation
Multiword Expression (MWE) contributes to major lexical ambiguity problems for any language and poses a big challenge in statistical machine translation. This paper presents the role of MWEs in improving the performance of phrase based Statistical machine Translation (PB-SMT) system. We preprocess the parallel corpus by single tokenizing the MWEs on both sides which leads to significant improve...
متن کاملA Chunk-Driven Bootstrapping Approach to Extracting Translation Patterns
We present a linguistically-motivated sub-sentential alignment system that extends the intersected IBM Model 4 word alignments. The alignment system is chunk-driven and requires only shallow linguistic processing tools for the source and the target languages, i.e. part-ofspeech taggers and chunkers. We conceive the sub-sentential aligner as a cascaded model consisting of two phases. In the firs...
متن کاملBitext Alignment for Statistical Machine Translation
Bitext alignment is the task of finding translation equivalence between documents in two languages, collections of which are commonly known as bitext. This dissertation addresses the problems of statistical alignment at various granularities from sentence to word with the goal of creating Statistical Machine Translation (SMT) systems. SMT systems are statistical pattern processors based on para...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012